.Net: Fix TextChunker orphan chunk token counting by MukundaKatta · Pull Request #14013 · microsoft/semantic-kernel

MukundaKatta · 2026-05-15T02:19:20Z

Motivation and Context

TextChunker.ProcessParagraphs used word counts when deciding whether to glue a small final/orphan paragraph back into the previous paragraph. With a custom token counter, that could merge two paragraphs whose actual token count exceeds maxTokensPerParagraph, producing an oversized final chunk.

Description

This changes the orphan merge check to build the candidate merged paragraph and evaluate it with GetTokenCount(...), so the same token-counting logic controls both splitting and final orphan gluing. It also adds a regression test using a custom length-based token counter where the previous word-count check would have produced an oversized merged chunk.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations
All unit tests pass, and I have added new tests where possible
I didn't break anyone :)

Local verification: git diff --check passes. I could not run dotnet test dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --filter FullyQualifiedName~TextChunkerTests --no-restore because this environment does not have the dotnet CLI installed.

Copilot

Pull request overview

Fixes a bug in TextChunker.ProcessParagraphs where the “orphan paragraph” merge decision used word counts instead of the configured token-counting logic, which could produce a merged paragraph exceeding maxTokensPerParagraph when a custom tokenCounter is supplied.

Changes:

Update orphan-merge logic to evaluate the merged candidate using GetTokenCount(...) (consistent with the rest of the splitting flow).
Remove the now-unused s_spaceChar constant from TextChunker.
Add a regression unit test using a length-based custom token counter to ensure orphan chunks are not merged beyond the token limit.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
dotnet/src/SemanticKernel.Core/Text/TextChunker.cs	Uses token counting (via `GetTokenCount`) to validate orphan-paragraph merges, preventing oversized merged chunks with custom token counters.
dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs	Adds a regression test covering the orphan-merge scenario with a custom token counter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 92% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach

Automated review by MukundaKatta's agents

Fix TextChunker orphan chunk token counting

4af7a3d

moonbox3 added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels May 15, 2026

github-actions Bot changed the title ~~Fix TextChunker orphan chunk token counting~~ .Net: Fix TextChunker orphan chunk token counting May 15, 2026

Copilot AI reviewed May 15, 2026

View reviewed changes

github-actions Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.Net: Fix TextChunker orphan chunk token counting#14013

.Net: Fix TextChunker orphan chunk token counting#14013
MukundaKatta wants to merge 1 commit into
microsoft:mainfrom
MukundaKatta:codex/textchunker-token-overlap

MukundaKatta commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MukundaKatta commented May 15, 2026

Motivation and Context

Description

Contribution Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Automated Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants